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ABSTRACT 

Summary: We present the first public release of our proteogenomic 
annotation pipeline. We have previously used our original unreleased 
implementation to improve the annotation of 46 diverse prokaryotic 
genomes by discovering novel genes, post-translational modifications 
and correcting the erroneous annotations by analyzing proteomic 
mass-spectrometry data. 

This public version has been redesigned to run in a wide range of 
parallel Linux computing environments and provided with the auto- 
mated configuration, build and testing facilities for easy deployment 
and portability. 

Availability and implementation: Source code is freely available from 
https://bitbucket.org/andreyto/proteogenomics under GPL license. It 
is implemented in Python and C++. It bundles the Makeflow engine to 
execute the workflows. 
Contact: atovtchi@jcvi.org 
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1 INTRODUCTION 

Our pipeline is a tool for improving the existing genomic anno- 
tations from available proteomics mass spectrometry data. 
As most genome annotation pipelines consist of automated 
gene finding, they lack experimental validation of primary struc- 
ture (Aziz et al., 2008; Markowitz et al., 2008), having to rely on 
DNA centric sources of data such as sequence homology, tran- 
scriptome mapping, codon frequency, etc. By incorporating the 
orthogonal set of data, proteogenomics is able to discover novel 
genes, post-translational modifications and correct the erroneous 
primary sequence annotations. 

The protocol and the large-scale application of our original 
pipeline to 46 taxonomically diverse genomes were reported in 
Venter et a!. (2011). The implementation was tightly coupled 
with the internal computation services framework (VICS) at 
the J. Craig Venter Institute (JCVI). VICS has never been de- 
ployed outside of the JCVI, and the pipeline itself required 
manual configuration and building by the developers. It could 
only use Sun Grid Engine (SGE) batch queuing system config- 
ured for high-throughput computing (HTC) mode in which large 
numbers of serial jobs could be efficiently scheduled on a 
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compute cluster. For these reasons, the original pipeline has 
not been made public. 

To create the first open source release presented here, we have 
redesigned the pipeline to run in a wide range of parallel Linux 
computing environments: 

• High-performance computing (HPC) clusters, which are set 
up to efficiently schedule only large (100s+ of cores) parallel 
Message Passing Interface (MPI) jobs under a control of 
batch queuing system such as Sun Grid Engine (SGE) and 
its clones. Simple Linux Utility for Resource Management 
(SLURM) or Portable Batch System (PBS)/Torque. Our 
primary targets for this use case were compute clusters of 
XSEDE (https://www.xsede.org/), the federation of super- 
computers supported by the US National Science 
Foundation. XSEDE allocates its resources to outside re- 
searchers through a peer-reviewed proposal system. The 
biologists will be able to use our software on this major 
computational resource. 

• High-throughput computing (HTC) clusters widely used as 
local bioinformatics computing resources. These clusters are 
configured to efficiently schedule large numbers of serial 
jobs under a control of batch queuing system. 

• A single multi-core workstation without a batch queuing 
system (including an extreme case of single-core machine). 

The volume of computations in proteogenomics is relatively 
high, with ~ 100 CPU hours for a typical bacterial genome. 
Our pipeline performs such annotation in ~3h of wall clock 
time on HTC cluster. 

We have now designed a fully automated installation proced- 
ure preconfigured for several types of specific target systems and 
easily adaptable to others through editing of a few configuration 
files. 

Although several other proteogenomic packages (Chapman 
et al., 2013; Kumar et al., 2013; Risk et al., 2013; Sanders 
et al., 2011) have been developed in recent years, they were de- 
signed for execution on a single workstation. None of the other 
publications matched the breadth of application reported for our 
pipeline in Venter et al. (20 II). 

The output files from that study are available at (http://omics. 
pnl.gov/pgp/overview.php). The contributed RefSeq updates can 
be seen in the Genbank flat files (.gbk) of the corresponding 
genomes at the NCBI wherever the proteomics data are listed 
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as experimental evidence. One example is the Mycohacterium 
tuberculosis H37Rv genome (ftp://ftp.ncbi.nih.gov/genomes/ 
Bacteria/Mycobacterium_tuberculosis_H37Rv_uid57777/NC_ 
000962.gbk) containing the CDS attributes/ 
experiment="EXISTENCE: identified in proteomics study". 



2 ARCHITECTURE AND IMPLEMENTATION 

2.1 Parallelization strategy 

In the present work, our main goal was to make tlie same pipe- 
line protocol portable across different parallel execution envir- 
onments that users are likely to encounter. The original 
algorithm is embarrassingly parallel for the most part. It pro- 
cesses each spectrum file independently throughout all computa- 
tionally intensive stages of the algorithm. There is a global 
synchronization point in the middle to build a histogram of all 
scores for P-value computation. Thus, the pipeUne corresponds 
to a distributed workflow where multiple serial processes are 
executed concurrently following a dependency graph defined 
by required input and output files. This model is compatible 
with a wide variety of execution environments such as standalone 
multicore machines, HTC clusters and, with extra effort, MPI 
clusters. The original unreleased implementation used HTC 
model tied into VICS and SGE. 

We have now achieved the portability across execution envir- 
onments by generating and running the same workflow under 
the Makeflow engine (http://nd.edu/~ccl/software/makeflow/) 
(Thrasher et al, 2010) that provides parallel execution on mul- 
tiple types of batch queuing systems as well as on standalone 
multicore nodes. On MPI clusters, Makeflow uses 'glide-in' ap- 
proach that we describe in PGP software manual. In short, the 
'glide-in' approach emulates an HTC cluster inside a single large 
MPI job. 

It will be also trivial to deploy our pipeline behind any Web 
services front-end such as Galaxy (Giardine et ah, 2005) or 
Taverna (Wolstencroft et al., 2013). Each run of the pipeline 
appears to the caller as a single command-line invocation of 
the entry point script that exits only once it finishes executing 
its parallel workflow. Backend options (batch queue or local 
multicore) are passed through the command arguments. No per- 
manently running server components are used by Makeflow. 
Deployment in Galaxy, for example, would be the same as de- 
ployment of a simple serial tool, requiring creation of a single 
XML tool description file. 



2.2 Installation and execution 

Newly developed installation procedure and documentation are 
part of the source code repository. The step-by-step installation 
and usage manual (also shown on the landing page at BitBucket) 
covers the execution environments, specific examples of config- 
uration files for each environment and instructions for adapting 
these files to new compute clusters. 

The manual also covers sample run-time parameters for dif- 
ferent environments. Quick Start instructions for testing the pipe- 
line on a small dataset included in the repository and example of 
interpreting the pipeline's output to discover a novel gene. The 
automated configuration and build procedure is driven by 
CMake (http://www.cmake.org/). Our installation procedure 
builds its own local copy of the Makeflow and several prote- 
omics tools from (http://proteomics.ucsd.edu). 
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